Goto

Collaborating Authors

 st model


Data Fusion of Deep Learned Molecular Embeddings for Property Prediction

Appleton, Robert J, Barnes, Brian C, Strachan, Alejandro

arXiv.org Artificial Intelligence

Data - driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability . To improve predictions, techniques such as transfer learning and multi - task learning have been used. T he performance of multi - task learning models depend s on the strength of the underlying correlations between tasks and the completeness of the dataset . S tandard multi - task models tend to underperform when trained on sparse datasets with weakly correlated properties. To address this gap, we fuse deep - learned embeddings generated by independent pre - trained single - task models, resulting in a multi - task model that inherit s rich, property - specific representations. By re - using (rather than re - training) these embeddings, the resulting fused model outperforms standard multi - task models and can be extended with fewer trainable parameters . We demonstrate this technique on a widely used benchmark dataset of quantum chemistry data for small molecules as well as a newly compiled sparse dataset of experimental data collected from literature and our own quant um chemistry and thermochemical calculations.


In this Supplementary Material we first present details of the Shapley value sampling Appendix A)

Neural Information Processing Systems

In this section, we introduce the details of the Shapley value sampling. We sample the Shapley value for models trained on CIFAR10, CIFAR100 and ImageNet. For CIFAR10 and CIFAR100, we employ ResNet-18 and train them ourselves. For ImageNet, we employ standard ResNet-50 provided in robustbench [11]. We demonstrate more Shapley V alue Quantification Results on ImageNet and TinyImageNet in Figure 1 and Figure 1.



End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data

Pothula, Aishwarya, Akkiraju, Bhavana, Bandarupalli, Srihari, D, Charan, Kesiraju, Santosh, Vuppala, Anil Kumar

arXiv.org Artificial Intelligence

The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.


Different Speech Translation Models Encode and Translate Speaker Gender Differently

Fucci, Dennis, Gaido, Marco, Negri, Matteo, Bentivogli, Luisa, Martins, Andre, Attanasio, Giuseppe

arXiv.org Artificial Intelligence

Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures -- integrating a speech encoder with a machine translation system via adapters -- do not. We also demonstrate that low gender encoding capabilities result in systems' tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.


Addressing speaker gender bias in large scale speech translation systems

Bansal, Shubham, Joshi, Vikas, Chadha, Harveen, Mehta, Rupeshkumar, Li, Jinyu

arXiv.org Artificial Intelligence

This study addresses the issue of speaker gender bias in Speech Translation (ST) systems, which can lead to offensive and inaccurate translations. The masculine bias often found in large-scale ST systems is typically perpetuated through training data derived from Machine Translation (MT) systems. Our approach involves two key steps. First, we employ Large Language Models (LLMs) to rectify translations based on the speaker's gender in a cost-effective manner. Second, we fine-tune the ST model with the corrected data, enabling the model to generate gender-specific translations directly from audio cues, without the need for explicit gender input. Additionally, we propose a three-mode fine-tuned model for scenarios where the speaker's gender is either predefined or should not be inferred from speech cues. We demonstrate a 70% improvement in translations for female speakers compared to our baseline and other large-scale ST systems, such as Seamless M4T and Canary, on the MuST-SHE test set.


Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

Yousefi, Midia, Qian, Yao, Chen, Junkun, Wang, Gang, Liu, Yanqing, Wang, Dongmei, Wang, Xiaofei, Xue, Jian

arXiv.org Artificial Intelligence

End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence's length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline.


CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Zhao, Rui, Li, Jinyu, Fan, Ruchao, Post, Matt

arXiv.org Artificial Intelligence

Models for streaming speech translation (ST) can achieve high accuracy and low latency if they're developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this paper, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched {source-target} language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively, while also boosting decoding speed by 59.7% on GPU.


Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach

Li, Siqi, Liu, Danni, Niehues, Jan

arXiv.org Artificial Intelligence

Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences, impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these valuable resources, we propose a retrieval-and-demonstration approach to enhance rare word translation accuracy in direct ST models. First, we adapt existing ST models to incorporate retrieved examples for rare word translation, which allows the model to benefit from prepended examples, similar to in-context learning. We then develop a cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to locate suitable examples. We demonstrate that standard ST models can be effectively adapted to leverage examples for rare word translation, improving rare word translation accuracy over the baseline by 17.6% with gold examples and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval approach outperforms other modalities and exhibits higher robustness to unseen speakers. Our code is publicly available (https://github.com/SiqiLii/Retrieve-and-Demonstration-ST).


Compact Speech Translation Models via Discrete Speech Units Pretraining

Lam, Tsz Kin, Birch, Alexandra, Haddow, Barry

arXiv.org Artificial Intelligence

We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.